Last year, Austin was ranked among the top US cities with a problem in homicide rate [1]. Earlier this year, the claim was made that crime has dropped to a new low since 2020 [2]. The intention of this analysis will be to investigate both claims, and gain a better understanding of Austin’s criminal activity. While these claims will be in mind throughout the steps of the analysis, I intend to approach it from more of an exploratory standpoint, so not to only go out and investigate the claims but really any insights that can be gleaned around Austin criminal activity and how it has changed over time. What the analysis will not cover would be things like causes of the changes in crime, at least not directly. I may identify shifts that could inform the cause, but the goal will not be to drill down into every potential socioeconomic factor and pick the primary drivers, instead simply to observe what crime in the city of Austin looks like and what we might be able to expect going forward. Aside from resources outlined in our course resources, I will also draw from this book[3]. The final output of this project, the manuscript you are reading, is rendered in my github analytics portfolio. The original repo dedicated to just this project can be found here.
Description of data and data source
My primary data source will come from here. The data is updated weekly by the Austin Police Department, and each record in the dataset represents an Incident Report, with the highest offense within an incident taking precedence in things like the description and categorization Each Incident can have other offense tied to it, however since each record is a unique incident then only the aforementioned Higehest Offense is the one represented (NOTE: At the time of this writing, 7/31/2024, the dataset has been taken down to be replaced with one that aligns more closely with the FBI National Incident Based Reporting System. The datasets are not one-to-one, so reproducibility would require more than a lift and shift, however some of the methods could still be employed).
The raw data is represented by several categorical, location, and time-based variables, many of which have missing values, or are formatted incorrectly for the data type, so they will need to be cleaned or recoded. After cleaning the data (done in a separate file found in this repo) we can take a look a more meaningful look.
Code
skim(d1)
Data summary
Name
d1
Number of rows
2461621
Number of columns
28
_______________________
Column type frequency:
character
12
Date
3
difftime
2
numeric
9
POSIXct
2
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Highest.Offense.Description
0
1
3
48
0
436
0
Family.Violence
0
1
1
1
0
2
0
Location.Type
0
1
7
47
0
47
0
Address
0
1
8
74
0
246951
0
APD.Sector
0
1
2
5
0
14
0
APD.District
0
1
1
2
0
21
0
PRA
0
1
1
4
0
742
0
Clearance.Status
0
1
0
1
615856
4
0
UCR.Category
0
1
0
3
1550375
17
0
Category.Description
0
1
0
18
1550375
8
0
Location
0
1
0
27
32335
219842
0
Crime.Category
0
1
4
29
0
33
0
Variable type: Date
skim_variable
n_missing
complete_rate
min
max
median
n_unique
Occurred.Date
0
1.00
2003-01-01
2024-06-01
2012-05-28
7823
Report.Date
0
1.00
2002-11-29
2024-06-02
2012-06-06
7825
Clearance.Date
348308
0.86
2003-01-01
2024-06-02
2012-10-17
7814
Variable type: difftime
skim_variable
n_missing
complete_rate
min
max
median
n_unique
Occurred.Time
0
1
0 secs
86340 secs
14:25:00
1440
Report.Time
0
1
0 secs
86340 secs
14:06:00
1440
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Incident.Number
0
1.00
6.031558e+10
2.896224e+11
20035.00
2.005329e+10
2.010505e+10
2.017186e+10
2.024242e+12
▇▁▁▁▁
Highest.Offense.Code
0
1.00
1.689080e+03
1.218280e+03
100.00
6.010000e+02
1.199000e+03
2.716000e+03
8.905000e+03
▇▅▁▁▁
Zip.Code
0
1.00
7.873243e+04
2.510000e+01
76574.00
7.871700e+04
7.874100e+04
7.875200e+04
7.875900e+04
▁▁▁▁▇
Council.District
30699
0.99
4.960000e+00
2.840000e+00
1.00
3.000000e+00
4.000000e+00
7.000000e+00
1.000000e+01
▅▇▃▃▅
Census.Tract
8822
1.00
2.453700e+02
3.363970e+03
1.00
1.500000e+01
2.324000e+01
3.380000e+02
9.508000e+05
▇▁▁▁▁
X.coordinate
0
1.00
3.075787e+06
3.551571e+05
0.00
3.108421e+06
3.117292e+06
3.126595e+06
3.231806e+06
▁▁▁▁▇
Y.coordinate
0
1.00
9.946761e+06
1.147895e+06
0.00
1.005743e+07
1.007300e+07
1.010056e+07
1.021550e+07
▁▁▁▁▇
Latitude
32335
0.99
3.029000e+01
8.000000e-02
30.01
3.023000e+01
3.028000e+01
3.035000e+01
3.067000e+01
▁▇▇▂▁
Longitude
32335
0.99
-9.773000e+01
5.000000e-02
-98.18
-9.776000e+01
-9.773000e+01
-9.770000e+01
-9.737000e+01
▁▁▇▂▁
Variable type: POSIXct
skim_variable
n_missing
complete_rate
min
max
median
n_unique
Occurred.Date.Time
0
1
2003-01-01 00:00:00
2024-06-01 23:46:00
2012-05-28 23:09:00
1738386
Report.Date.Time
0
1
2002-11-29 05:30:00
2024-06-02 01:20:00
2012-06-06 11:15:00
2169726
I preserved quite a few variables from the original dataset, but the primary ones of interest will be Occurred.Date.Time, Zip.Code, Highest.Offense.Description, Category.Description, and Crime.Category. The last few variables pertaining to the type of crime committed are really all rolled into the last one, Crime.Category for the purposes of this analysis. Crime Category is a derived field that categorizes crimes into several different categories for the sake of understanding what crime occurred but without getting flooded with minute details. The categorizations were done manually by myself, and can be seen in the processing file as well as a description of the method performed. The short version is the categories are based on the UCR Crime descriptions given by the FBI when available, and are bucketed similarly from the Highest Offense field via string detection. This brings the number of unique categories in the Highest Offense field from 436 to 32 in the derived field.
Questions/Hypotheses to be addressed
Ultimately, I would like to explore a few questions:
Has crime truly dropped, or is it expected to rise again as the year progresses?
Has the homicide rate dropped with crime?
Has the homicide rate been a problem for long time, or was it truly an emerging problem last year?
Schematic of workflow
The intention of this analysis is to be exploratory; while I do have some questions I would like to chase down, I intend for them to be more of a compass than a map. With that said, the general strategy taken can be boiled down as follows:
Check for data quality, understand data representation
Clean the data, address problems identified in previous step, some exploration
More “formal” exploration. Consider what variables might be most important, and how those variables can be represented in the analysis.
Repeat any steps performed above as necessary.
Resolve remaining or initial questions using statistical methods.
Methods
To accomplish the steps described above, a number of techniques and methodologies were employed.
Cleaning: The data had a number of string variables for date fields, so converting these to proper date variables was necessary. Several variables had missing or empty values. Most of these records were removed, since they represented less than 1% of the data, and were often related to the seriousness of some crimes (like FBI descriptions and categorizations).
Imputing: A few variables had the opportunity to impute the data logically. For example, there were some records with a report date but no occurrence date. By nature of the variable definitions, the occurrence date should always be less than or equal to the report date, so in instances when the occurrence time was missing but the report time was available, I imputed the occurrence date with the report date. One could consider this a change in the definition of the variable, something like “the latest time the crime could have occurred,” but for the purposes of this analysis that would suffice.
Transformations/ Derived Variables: The descriptions of the offenses were difficult to do much with simply because there were so many different types. As a result, I ultimately decided to go through and bucket the different types of crimes by their description, by using a combination of other categorical variables when they were available and string detection to effectively create a lexicon that would categorize the variables.
Exploratory Analysis: The exploratory analysis done was very iterative, first focused the data quality, so a lot of filter, transforming, and summarizing the variables discussed above with basic summary and dplyer functions. After cleaning visualizations like bar charts, choropleth charts, and line graphs were used to understand patterns in the data, largely done with ggplot and plotly. After statistical analysis was performed, more visualizations in the form of line graphs with forecasts and error ranges.
Statistical Analysis: Primary statistical analysis performed was in the form of a hierarchical ARIMA forecast. Exponential Smoothing was also briefly considered, as well as a non-hierarchical ARIMA method. Primary methods to perform these forecasts were with the fable and forecast package. The forecasts were produced using material from the the book Forecasting Principals and Practice (3rd Edition) [3].
Results
This analysis focuses on three simple factors when it comes to the occurrence of crime: “when”, “where”, and “what”. Quickly summarizing the results we see the following.
Starting with “When”, we see that in recent years crime is generally decreasing and has been since around 2008. Year over year rates of homicide have been generally increasing in the years leading up to 2020. The claim made in the previously mentioned news article is that they are decreasing, and the time series plot arguably corroborates this[2]. Still, the claim is worth investigating to see if the decreases are cyclical, seasonal, or just coincidence. Location variables are also provided in the data, so it is worth exploring the “where” that the crimes are occurring. Since zip code information was provided, and is a very simple location variable that people are familiar with, that seems like an easy place to start. We will see that crime is most common in a few zip codes in the center of the city, and that over time crime in those areas does appear to be decreasing, while the outer areas are relatively constant in the occurrence of crime.
Finally, the “what” is important simply because we can’t celebrate decreases in crime if it’s at the cost of more severe crime happening. We will isolate to homicide specifically, and observe the trends over time, then using data from 2003 through 2023 we will attempted to predict the month over month occurrence of homicide in 2024, to see how the data stacks up to what is actually happening. We ultimately fit a Hierarchical Seasonal ARIMA model, isolate to instances of Murder, and see that in the first few months of 2024, there may be a decrease in the rate of homicide compared to what was expected.
Exploratory/Descriptive analysis
A time series plot is a very natural starting part for the exploration of this analysis. given that the data is at the incident level, it is worth the effort to aggregate the count of occurrences up to some interval of time. We have values of all the way down to reported time of occurrence, but given the imputations that needed to occur in the data cleaning step, as well as the questionable reliability of whether the reported occurrence time is accurate to begin with, daily seems like the smallest interval of time that could be worth aggregating over. Figure 1 below is the corresponding aggregated time series plot. While we can see a general trend, the plot is rather dense since there are so many points plotted along the x-axis. If we want to do anything like isolating down to specific crimes like homicide, we will needed to aggregate at less fine of a grain, otherwise we will have too many intervals with 0 values for the observations.
Figure 1: Crime Forecast Residual Plots
Figure 2 below is the corresponding time series plot aggregated over each month. We still see a similar rise and fall trend like we did in the daily plot, but it is more pronounced since there are fewer outliers like there were in the daily plot. Outside of the overall trend, we can also see some potential patterns like a seasonal trend, and a change in the overall trend around 2018 until around 2020 where it briefly increases again before returning to a general decrease.
Figure 2: Crime Forecast Residual Plots
Jumping ahead just a bit, after performing the simple lexicon based classification and properly identifying murder crimes, we can isolate to those crimes specifically for a time series plot. Figure 3 is exactly this, and we can see the trend is very different from what is seen in the overall time-series plot. For the most part, the rate of murder is relatively low up through 2015, and it is around 2016 that we see it spike and then slowly increase up until about 2022, where it then starts to slow back down.
Figure 3: Monthly Murder Line Plot
These changes are interesting, primarily because we see the same shift before 2020 across overall crime and murder, but the number of murders is too small to be the primary driver in the overall crime trend. It’s also interesting because of the way the increase in murders start happening before the overall increase, and continues after the overall has started to decrease again. These may be caused by correlated factors like population, economic, or political factors, that the rate of murder could be more sensitive to than other crimes are. Furthermore, because the rate of murder is so low (relative to other types of crime), it’s likely that the variance in the trend is always going to be more sensitive to external factors. Once again the goal of this analysis is not intended to explain the why, but I think these observations could make for some good future studies into the potential causes or correlations. For now, the main question that these visuals bring me are the apparent trends; are they actually material, is the decrease seen the result of a change in the actual rate, or merely happenstance and we can expect the rate of murder to increase in future months? Before I attempt to explore this question, I want to see what variables might help me to explain it. The last chart considers the type of crime, which we will see in a moment, but it does not consider the location. For location, I have created a visualization based on Zip Code that I think can be revealing.